Identifying key phrases within documents

ABSTRACT

Systems are used for identifying key phrases within documents. These systems utilize a tags and a tag index to determine what a document primarily relates to. For example, an integrated data flow and extract-transform-load pipeline, crawls, parses and word breaks large corpuses of documents in database tables. Documents can be broken into tuples. The tuples can be sent to a heuristically based algorithm that uses statistical language models and weight plus cross-entropy threshold functions to summarize the document into its “top N” most statistically significant phrases. These systems can scale efficiently (e.g., linearly) and (potentially large numbers of) documents can be characterized by salient and relevant key phrases (tags).

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/959,840 filed on Dec. 3, 2010 and entitled “IDENTIFYING KEY PHRASESWITHIN DOCUMENTS,” which issued as U.S. Pat. No. 8,423,546 on Apr. 16,2013, and which application is expressly incorporated herein byreference in its entirety.

BACKGROUND 1. Background and Relevant Art

Computer systems and related technology affect many aspects of society.Indeed, the computer system's ability to process information hastransformed the way we live and work. Computer systems now commonlyperform a host of tasks (e.g., word processing, scheduling, accounting,etc.) that prior to the advent of the computer system were performedmanually. More recently, computer systems have been coupled to oneanother and to other electronic devices to form both wired and wirelesscomputer networks over which the computer systems and other electronicdevices can transfer electronic data. Accordingly, the performance ofmany computing tasks are distributed across a number of differentcomputer systems and/or a number of different computing environments.

For many organizations, documents easily comprise the largestinformation assets by volume. As such, characterizing a document by itssalient features, such as, for example, its key words and phrases, is animportant piece of functionality.

One technique for characterizing documents includes using full textsearch solutions that mine documents into full text inverted indices.Another technique for characterizing documents mines document levelsemantics (e.g., to identify similarities between documents). Properimplementation of either of these two techniques can require heavyinvestments in both computer hardware and personnel resources.

Further, document parsing, mining, etc. operations are often replicatedacross these two techniques. As such, an end user pays additional costsby having to invest in (perhaps as much as double) resources to reap thebenefits of both search and semantic insight over their documents.Additionally, many more complex document mining techniques requireintegrating disparate systems together and lead to further costs inorder to satisfy an organization's document processing needs.

BRIEF SUMMARY

The present invention extends to methods, systems, and computer programproducts for identifying key phrases in documents. In some embodiments,a document is accessed. The frequency of occurrence of a plurality ofdifferent textual phrases within the document is calculated. Eachtextual phrase includes one or more individual words of a specifiedlanguage. A language model for the specified language is accessed. Thelanguage model defines expected frequencies of occurrence at least forindividual words of the specified language.

For each textual phrase in the plurality of different textual phrases across-entropy value is computed for the textual phrase. Thecross-entropy value is computed from the frequency of occurrence of thetextual phrase within the document and the frequency of occurrence ofthe textual phrase within the specified language. A specified number ofstatistically significant textual phrases from within the document areselected based on the computed cross-entropy values. A key phrase datastructure is populated a with data representative of each of theselected specified number of statistically significant textual phrases.

In other embodiments, a document containing a plurality of textualphrases is accessed. For each textual phrase in the plurality of textualphrases contained the document, a location list is generated for thetextual phrase. The location list indicates one or more locations of thetextual phrase within the document. For each textual phrase in theplurality of textual phrases contained in the document, a score isassigned to the textual phrase. The score is based on the contents ofthe location list for the textual phrase relative to the occurrence ofthe textural phrase in a training set of data.

The plurality of textual phrases is ranked according to the assignedscores. A subset of the plurality of textual phrases is selected fromwithin the document based on the rankings. A key phrase data structureis populated from the selected subset of the plurality of textualphrases.

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

Additional features and advantages of the invention will be set forth inthe description which follows, and in part will be obvious from thedescription, or may be learned by the practice of the invention. Thefeatures and advantages of the invention may be realized and obtained bymeans of the instruments and combinations particularly pointed out inthe appended claims. These and other features of the present inventionwill become more fully apparent from the following description andappended claims, or may be learned by the practice of the invention asset forth hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and otheradvantages and features of the invention can be obtained, a moreparticular description of the invention briefly described above will berendered by reference to specific embodiments thereof which areillustrated in the appended drawings. Understanding that these drawingsdepict only typical embodiments of the invention and are not thereforeto be considered to be limiting of its scope, the invention will bedescribed and explained with additional specificity and detail throughthe use of the accompanying drawings in which:

FIG. 1 illustrates an example computer architecture that facilitatesidentifying key phrases within documents.

FIG. 2 illustrates a flow chart of an example method for identifying keyphrases within documents.

FIG. 3 illustrates an example computer architecture that facilitatesidentifying key phrases within documents.

FIG. 4 illustrates a flow chart of an example method for identifying keyphrases within documents.

DETAILED DESCRIPTION

The present invention extends to methods, systems, and computer programproducts for identifying key phrases in documents. A document isaccessed. The frequency of occurrence of a plurality of differenttextual phrases within the document is calculated. Each textual phraseincludes one or more individual words of a specified language. Alanguage model for the specified language is accessed. The languagemodel defines expected frequencies of occurrence at least for individualwords of the specified language.

For each textual phrase in the plurality of different textual phrases across-entropy value is computed for the textual phrase. Thecross-entropy value is computed from the frequency of occurrence of thetextual phrase within the document and the frequency of occurrence ofthe textual phrase within the specified language. A specified number ofstatistically significant textual phrases from within the document areselected based on the computed cross-entropy values. A key phrase datastructure is populated a with data representative of each of theselected specified number of statistically significant textual phrases.

In other embodiments, a document containing a plurality of textualphrases is accessed. For each textual phrase in the plurality of textualphrases contained the document, a location list is generated for thetextual phrase. The location list indicates one or more locations of thetextual phrase within the document. For each textual phrase in theplurality of textual phrases contained the document, a score is assignedto the textual phrase. The score is based on the contents of thelocation list for the textual phrase relative to the occurrence of thetextural phrase in a training set of data.

The plurality of textual phrases is ranked according to the assignedscores. A subset of the plurality of textual phrases is selected fromwithin the document based on the rankings. A key phrase data structureis populated from the selected subset of the plurality of textualphrases.

Embodiments of the present invention may comprise or utilize a specialpurpose or general-purpose computer including computer hardware, suchas, for example, one or more processors and system memory, as discussedin greater detail below. Embodiments within the scope of the presentinvention also include physical and other computer-readable media forcarrying or storing computer-executable instructions and/or datastructures. Such computer-readable media can be any available media thatcan be accessed by a general purpose or special purpose computer system.Computer-readable media that store computer-executable instructions arecomputer storage media (devices). Computer-readable media that carrycomputer-executable instructions are transmission media. Thus, by way ofexample, and not limitation, embodiments of the invention can compriseat least two distinctly different kinds of computer-readable media:computer storage media (devices) and transmission media.

Computer storage media (devices) includes RAM, ROM, EEPROM, CD-ROM orother optical disk storage, magnetic disk storage or other magneticstorage devices, or any other medium which can be used to store desiredprogram code means in the form of computer-executable instructions ordata structures and which can be accessed by a general purpose orspecial purpose computer.

A “network” is defined as one or more data links that enable thetransport of electronic data between computer systems and/or modulesand/or other electronic devices. When information is transferred orprovided over a network or another communications connection (eitherhardwired, wireless, or a combination of hardwired or wireless) to acomputer, the computer properly views the connection as a transmissionmedium. Transmissions media can include a network and/or data linkswhich can be used to carry or desired program code means in the form ofcomputer-executable instructions or data structures and which can beaccessed by a general purpose or special purpose computer. Combinationsof the above should also be included within the scope ofcomputer-readable media.

Further, upon reaching various computer system components, program codemeans in the form of computer-executable instructions or data structurescan be transferred automatically from transmission media to computerstorage media (devices) (or vice versa). For example,computer-executable instructions or data structures received over anetwork or data link can be buffered in RAM within a network interfacemodule (e.g., a “NIC”), and then eventually transferred to computersystem RAM and/or to less volatile computer storage media (devices) at acomputer system. Thus, it should be understood that computer storagemedia (devices) can be included in computer system components that also(or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions anddata which, when executed at a processor, cause a general purposecomputer, special purpose computer, or special purpose processing deviceto perform a certain function or group of functions. The computerexecutable instructions may be, for example, binaries, intermediateformat instructions such as assembly language, or even source code.Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the described features or acts described above.Rather, the described features and acts are disclosed as example formsof implementing the claims.

Those skilled in the art will appreciate that the invention may bepracticed in network computing environments with many types of computersystem configurations, including, personal computers, desktop computers,laptop computers, message processors, hand-held devices, multi-processorsystems, microprocessor-based or programmable consumer electronics,network PCs, minicomputers, mainframe computers, mobile telephones,PDAs, pagers, routers, switches, and the like. The invention may also bepracticed in distributed system environments where local and remotecomputer systems, which are linked (either by hardwired data links,wireless data links, or by a combination of hardwired and wireless datalinks) through a network, both perform tasks. In a distributed systemenvironment, program modules may be located in both local and remotememory storage devices.

In general, an integrated data flow and extract-transform-load pipeline,crawls, parses and word breaks large corpuses of documents in databasetables. Documents can be broken into tuples. In some embodiments, thetuples are of the format {phrase, frequency}. A phrase can include oneor more words and the frequency is the frequency of occurrence within adocument. The tuples can be sent to a heuristically based algorithm thatuses statistical language models and weight+cross-entropy thresholdfunctions to summarize the document into its “top N” most statisticallysignificant phrases.

Alternately, tuples can be of the format including {phrase, locationlist}. The location list lists the locations of the phrase within adocument. The tuples are sent to a Keyword Extraction Algorithm (“KEX”)to compute, potentially with a higher quality (e.g. less noisy phrases),a set of textually relevant tags. Accordingly, documents can becharacterized by salient and relevant key phrases (tags).

When a plurality of documents is being processed, each tuple can alsoinclude a document ID.

FIG. 1 illustrates an example computer architecture 100 that facilitatesidentifying key phrases within documents. Referring to FIG. 1, computerarchitecture 100 includes database 101, frequency calculation module102, cross-entropy calculation module 103, phrase selector 106, and keyphrase data structure 107. Each of the depicted computer systems isconnected to one another over (or is part of) a network, such as, forexample, a Local Area Network (“LAN”), a Wide Area Network (“WAN”), andeven the Internet. Accordingly, each of the depicted components as wellas any other connected computer systems and their components, can createmessage related data and exchange message related data (e.g., InternetProtocol (“IP”) datagrams and other higher layer protocols that utilizeIP datagrams, such as, Transmission Control Protocol (“TCP”), HypertextTransfer Protocol (“HTTP”), Simple Mail Transfer Protocol (“SMTP”),etc.) over the network.

Database 101 can be virtually any type of database (e.g., a StructuredQuery Language (“SQL”) database or other relational database). Asdepicted, database 101 can contain one or more tables including table109. Each table in database 101 can include one or more rows and one ormore columns used to organize data, such as, for example, documents. Forexample, table 109 includes a plurality of documents including documents112 and 122. Each document can be identified by a corresponding documentID. For example, document ID 111 can identify document 112, document ID121 can identify document 122, etc.

Frequency calculation module 102 is configured to calculate thefrequency of occurrence of a textual phrase within a document. Frequencycalculation module 102 can receive a document as input. From thedocument, frequency calculation module 102 can calculate the frequencywith which one or more textual phrases occur in the document. A textualphrase can include one or more words of a specified language. Frequencycalculation module 102 can output a list of phrases and correspondingfrequencies for a document.

In general, cross-entropy calculation module 103 is configured tocalculate a cross-entropy between phrases in a specified document andthe same phrases in a corresponding language module. Cross-entropycalculation module 103 can receive a list of one or more phrases andcorresponding frequencies of occurrence for a document. Cross-entropycalculation module 103 can also receive a statistical language model.The statistical language model can include a plurality of words (orphrases) of a specified language and can define an expected frequency ofoccurrence for each of the plurality of words (or phrases) in thelanguage.

Cross-entropy can measure the “amount of surprise” in the frequency ofoccurrence of a phrase in a specified document relative the frequency ofoccurrence of the phrase in the language model. For example, aparticular phrase can occur with more or less frequency in a specifieddocument as compared to the language model. Thus, cross-entropycalculation module 103 can be configured to calculate the cross-entropybetween the frequency of occurrence of a phrase in a specified documentand the frequency of occurrence of the phrase in a language module.

In some embodiments, expected frequencies of occurrence represent howoften a word (or phrase) generally occurs within the specific language.In other embodiments, expected frequencies of occurrence are adjustedfor particular document domains, such as, for example, legal documents,medical documents, engineering documents, sports related documents,financial documents, etc.

When appropriate, combiner 104 can combine one or more words from alanguage model into a phrase contained in a document. For example,combiner 104 can combine the words ‘annual’ and ‘budget’ into “annualbudget”. Combiner 104 can also compute a representative expectedfrequency for a phrase from expected frequencies for individual wordsincluded in the phrase. For example, combiner 104 can compute anexpected frequency for “annual budget” from an expected frequency for‘annual’ and an expected frequency for ‘budget’. Combiner 104 caninclude an algorithm for inferring (e.g., interpolating, extrapolating,etc.) an expected frequency for a phrase from a plurality of frequenciesfor individual words.

Cross-entropy calculation module 103 can output a list of one morephrases and corresponding cross entropies.

Phrase selection module 106 is configured to select phrases forinclusion in a key phrase data structure for a document. Phraseselection module 106 can receive a list of one or more phrases andcorresponding cross entropies. Phrase selection module 106 can alsoreceive one or selection functions. Phrase selection module 106 canapply the selection functions to the cross entropies to select a subsetof phrases for inclusion in the key phrase data structure for thedocument. Selection functions can include weighting functions and/orthreshold functions. Selected phrases can be copied to the key phrasedata structure for the document.

FIG. 2 illustrates a flow chart of an example method 200 for identifyingkey phrases within documents. Method 200 will be described with respectto the components and data in computer architecture 200.

Method 200 includes an act of accessing a document (act 201). Forexample, frequency calculation module 102 can access document 112.Method 200 includes an act of calculating the frequency of occurrence ofa plurality of different textual phrases within the document, eachtextual phrase including one or more individual words of a specifiedlanguage (act 202). For example, frequency calculation module 102 cancalculate the frequency of occurrence of a plurality of textual phrases,such as, for example, phrases 131, 132, and 133, within document 112.Each textual phrase in document 112 can include one or more individualwords of a specified language (e.g., English, Japanese, Chinese,languages of India, etc.).

A frequency for a phrase can represent how often a phrase occurs indocument 112. For example, frequency 141 represents how often phrase 131occurs in document 112, frequency 142 represents how often phrase 132occurs in document 112, frequency 143 represents how often phrase 133occurs in document 112, etc. Frequency calculation module 102 cancalculate frequencies for other additional phrases within document 112.Frequency calculation module 102 can send the phrases and correspondingfrequencies to cross-entropy calculation module 103. Cross-entropycalculation module 103 can receive the phrases and correspondingfrequencies from frequency calculation module 102.

Method 200 includes an act of accessing a language model for thespecified language, the language model defining expected frequencies ofoccurrence at least for individual words of the specified language (act203). For example, cross-entropy calculation module can accessstatistical language model 159. Statistical language model 159 candefine expected frequencies of occurrence for words of the language ofdocument 112. For example, word 161 has expected frequency 171, word 162has expected frequency 172, etc.

For each textual phrase in the plurality of different textual phrases,method 200 includes an act of computing a cross-entropy value for thetextual phrase, the cross-entropy value computed from the frequency ofoccurrence of the textual phrase within the document and the frequencyof occurrence of the textual phrase within the specified language (act204). For example, cross-entropy calculation module 103 can compute across-entropy value for phrases from document 112, such as, for example,phrases 131, 132, 133, etc. Cross-entropy for phrases 131, 132, 133,etc., can be computed from frequencies 141, 142, 143, etc., and expectedfrequencies 171, 172, etc. For phrases that occur more frequently thanexpected, cross-entropy can be increased. On the other hand, for phrasesthat occur less frequently than expected, cross-entropy can bedecreased.

When appropriate, combiner 104 can compute an expected frequency for aphrase from expected frequencies for one or more words included in thephrase.

In some embodiments, cross entropy is computed in accordance with thefollowing pseudo code example (where an ngram represents a phrase):

languageModel = SelectLanguageModel(document) candidates = empty topNpriority queue; foreach((ngram, locations) in DNI[document]) {  score =ComputeCrossEntropy(       document.GetSize( ),       locations.Length,// actual ngram frequency in       current document      languageModel.GetLogProb(ngram) // expected ngram logprob fromlanguage model     );   candidates.Add(ngram, score); } whereinComputeCrossEntropy(numWordsInDocument, numOccurences, logprob) { // wereward repeated occurrences; BoostMultiplier = 20   if (numOccurences >1): numOccurences *= BoostMultiplier   observedLogprob =Log10(numOccurences/numWordsInDocument)   rawWeight =logprob/observedLogprob // smoothen the result to better cover the 0-1range.   result = (((maxWeightCommonRange−minWeightCommonRange)/(    maxLogprobCommonRange−minLogprobCommonRange)) *     (rawWeight-minLogprobCommonRange)) +     minWeightCommonRange  if result < 0:result = 0  if result > 1: result = 1  return result }

In some embodiments, values for one or more of minWeightCommonRange,maxWeightCommonRange are selected to linearize results. For example,minWeightCommonRange (=0.1) and maxWeightCommonRange (=0.9) can be usedto denote the “common range of values (0.1-0.9), while the “leftovers”from 0-1 (0-0.1, and 0.9-1) are left for extreme values.

In some embodiments, minLogprobCommonRange and maxLogprobCommonRange arecalculated from experimental results. For example, minLogprobCommonRangecan be experimentally calculated as 2 and 12 (a range where the valuesfor the rawWeight are commonly included).

The pseudo code can be used to measure and reward the “amount ofsurprise” that each n-gram (phrase) has in the context of a givendocument. That is, the more frequent an n-gram is in comparison with itsexpected frequency, the more weight it carries in that document.

This amount of surprise can more crudely be measured asactualFrequency/expectedFrequency. However, the ComputeCrossEntropyfunction provides a more sophisticated measurement that accounts fordocument length. The ComputeCrossEntropy function balances credit forvery short and very long documents. For example, ComputeCrossEntropyfunction is configured to not give too much credit to very shortdocuments nor steal to much credit from very long documents.

Method 200 includes an act of selecting a specified number ofstatistically significant textual phrases from within the document basedon the computed cross-entropy values (act 205). For example,cross-entropy calculation module 103 can return a maximum number of topcandidates based on computed cross-entropies. The number of topcandidates can be all or some number less than all of the phrasescontained in document 112, such as, for example, phrases 131, 132, 133,etc. Cross-entropy calculation module 103 can output the number of topcandidates long with their corresponding cross-entropy values to phraseselector 106. For example, phrase 131 can be output with cross-entropy151, phrase 132 can be output with cross-entropy 152, phrase 133 can beoutput with cross-entropy 153, etc. Phrase selector 106 can receive thenumber of top candidates long with their corresponding cross-entropyvalues from cross-entropy calculation module 103.

Phrase selector 106 can apply selection functions 158 to filter out oneor more of the top candidates. Selections functions 158 can includeweighting and/or threshold functions. Weighting functions can be used torank phrase relevance (based on cross-entropy) in a key phrase datastructure. Weighting functions can also provide a sufficiently detailedsort order with respect to both document similarity and phraserelevance. Threshold functions allow a key phrase data structure to bemaintained in a lossy state. Threshold functions can be used to pruneout phrases that have a cross-entropy under a specified cross-entropythreshold for a document.

Various different types of free parameters, such as, for example,cross-entropy/log probability, term frequency, document length, etc, canbe used in selection functions. Functional forms for selection functionscan be selected arbitrarily. For example, some possible types ofweighting functions include:

Functional form Example Linear f(.) = ax1 + bx2 + c Polynomial f(.) =ax1^(n) + bx2^(n−1) Ratio f(.) = ax1^(n)/bx2^(m) Exponential 2^(f(.)),e^(f(.))

Similarly, threshold functions can be of the form: f(.)<T, or of theform f(.)/g(.)<T %.

When both weighting and threshold functions are applied, it may be thatphrase selector 106 outputs a set of phrases sorted from more relevantto less relevant, wherein the least relevant phrase retains a thresholdrelevance. For example, phrase selector 106 can output one or morephrases form document 112 such as, for example, phrases 132, 191, 192,etc.

Method 200 includes an act of populating a key phrase data structurewith data representative of each of the selected specified number ofstatistically significant textual phrases (act 206). For example, phraseselector 106 can populated key phrase data structure 107 with phrases132, 191, 192, etc. Phrases may or may not be stored along with acorresponding weight in a key phrase data structure. For a specifieddocument, a key phrase data structure can be of the non-normalizedformat:

Tags: heart (w1), attack (w2), clogging (w3), PID:99 (w4)or of the normalized format:

Tag Weight heart w1 attack w2 clogging w3 PID:99 w4

When a plurality of documents are processed (e.g., document 112, 122,etc), a document ID (e.g., document ID 111, 121, etc.) can travel alongwith each phrase to indicate the document where each phrase originated.In these embodiments, a key phrase data structure can be of thenon-normalized format:

Doc Id Tags 218 heart (w1), attack (w2), clogging (w3), PID:99 (w4)or of the normalized format:

Doc Id Tag Weight 218 heart w1 218 attack w2 218 clogging w3 218 PID:99w4

FIG. 3 illustrates an example computer architecture 300 that facilitatesidentifying key phrases within documents. Referring to FIG. 3, computerarchitecture 300 includes database 301, location indexer 302, keywordextractor 303, ranking module 306, and key phrase data structure 307.Each of the depicted computer systems is connected to one another over(or is part of) a network, such as, for example, a Local Area Network(“LAN”), a Wide Area Network (“WAN”), and even the Internet.Accordingly, each of the depicted components as well as any otherconnected computer systems and their components, can create messagerelated data and exchange message related data (e.g., Internet Protocol(“IP”) datagrams and other higher layer protocols that utilize IPdatagrams, such as, Transmission Control Protocol (“TCP”), HypertextTransfer Protocol (“HTTP”), Simple Mail Transfer Protocol (“SMTP”),etc.) over the network.

Database 301 can be virtually any type of database (e.g., a StructuredQuery Language (“SQL”) database or other relational database). Asdepicted, database 301 can contain one or more tables including table309. Each table in database 301 can include one or more rows and one ormore columns used to organize data, such as, for example, documents. Forexample, table 309 includes a plurality of documents including documents312 and 322. Each document can be identified by a corresponding documentID. For example, document ID 311 can identify document 312, document ID321 can identify document 322, etc.

Location indexer 302 is configured to identify one or more locationswithin a document where phrases are located.

Keyword extractor 303 is configured to score key phrases from a documentbased on a location list for the key phrases relative to the occurrenceof phrases in a training data set. A training data set can be used atkeyword extractor 303 to produce a model for a supported language. Insome embodiments, a phrase is used as a query term submitted to a searchengine. Web pages returned in the search results from the query term areused as training data for the phrase. Training for a language can occurin accordance with the following pseudo code (where an ngram representsa phrase):

store = InitializeModel(language) // set of documents and associatedkeyphrases trainingSet = empty Dictionary<document,Set<ngram>> foreach(language in SetOfLanguagesWeSupport) {  foreach ((ngram, frequency)inTrainingLanguageModel(language))  {   // seeding the store with thelanguage model frequencies   store.Add(ngram, frequency)  }  //SelectSampleOf is selecting about 10000 ngrams from the language modelto issue queries for  foreach (ngram in SelectSampleOf(source= TrainingLanguageModel(language)))  {   // we only need about 10000training documents   if (trainingSet.Lengh >= 10000) break;   // we onlyretain the top URL that matches our needs   URL document =QuerySearchEngine(ngram);   keyphrases = new Set<ngram>( );  keyphrases.Add(ngram); // add the query as a keyphrase  trainingSet.Insert(document, keyphrases)  }  // parse the documents,add contained ngrams as keyphrases  foreach ((document, keyphrases) intrainingSet)  {   foreach (ngram in document)   {   trainingSet[document].Add(ngram)   }  }  // process the training setand build the KEX model  // this part is generic, can take as input anytraining set, regardless if it was produced from querying search engineor is a manually tagged set of documents  foreach ((document,keyphrases) in trainingSet)  {   foreach(keyphrase in keyphrases)   {   // it is a bit more complex than this, because we need todifferentiate between keyphrases that were used as queries vs the onesthat were only found inside the doc, etc.    store.Update(document,keyphrase)   }  } }

Keyword extractor 303 can run phrases and corresponding location listsagainst the model to extract phrases from a document. Keywords can beextracted in accordance with the following psuedocode (for a document ina given language and where an ngram represents a phrase):

store = ChooseModel(language) features = empty collection foreach((ngram, locations) in DNI[document]) {  if (ngram is not in store)continue;  storedFeatures = store.GetFeatures(ngram);  foreach (locationin locations)  {   dynamicFeatures = ComputeFeatures(location, ngram);  features.Insert(ngram, join(storedFeatures, dynamicFeatures));  } }candidates = empty dictionary; foreach(ngram in features.Keys) {  //this uses the predictive-model part of KEX trained model  score =RelevanceScore(features[ngram]));  if (score > threshold)  {  candidates.Add(ngram, score);  } } return maxResults top-scorecandidates in score-decreasing order;

Ranking module 306 is configured to receive phrases and correspondingscores and rank the phrases in accordance with the scores. Rankingmodule 306 can store the ranked phrases in key phrase data structure307.

FIG. 4 illustrates a flow chart of an example method 400 for identifyingkey phrases within documents. Method 400 will be described with respectto the components and data in computer architecture 300.

Method 400 includes an act of accessing a document containing aplurality of textual phrases (act 401). For example, location indexer302 can access document 312. Document 312 can contain a plurality oftextual phrases, such as, for example, phrases 331, 332, 333, etc.

For each textual phrase in the plurality of textual phrases contained inthe document, method 400 includes an act of generating a location listfor the textual phrase, the location list indicating one or morelocations of the textual phrase within the document (act 402). Forexample, location indexer 302 can generate locations list 341 for phrase331. Locations list 341 indicates one or more locations within document312 where phrase 331 is found. Similarly, location indexer 302 cangenerate locations list 342 for phrase 332. Locations list 342 indicatesone or more locations within document 312 where phrase 332 is found.Likewise, location indexer 302 can generate locations list 343 forphrase 333. Locations list 343 indicates one or more locations withindocument 312 where phrase 333 is found. Location lists for other phrasesin document 312 can also be generated.

Location indexer 302 can send phrases and corresponding locations liststo keyword extractor 303. Keyword extractor 3030 can receive phrases andcorresponding locations lists from location indexer 302.

For each textual phrase in the plurality of textual phrases containedthe document, method 400 includes an act of assigning a score to thetextual phrase based on the contents of the location list for thetextual phrase relative to the occurrence of the textural phrase in atraining set of data (act 403). For example, keyword extractor 303 canassign score 351 to phrase 331 based on the contents of locations list341 relative to the occurrence of phrase 331 in training data 359.Similarly, keyword extractor 303 can assign score 352 to phrase 332based on the contents of locations list 342 relative to the occurrenceof phrase 332 in training data 359. Likewise, keyword extractor 303 canassign score 353 to phrase 333 based on the contents of locations list343 relative to the occurrence of phrase 333 in training data 359.Scores for other phrases (e.g., phrases 393 and 394) can also beassigned.

Keyword extractor 303 can send phrases and corresponding scores toranking module 306. Ranking module 306 can receive phrases andcorresponding scores from keyword extractor 303.

Method 400 includes an act of ranking the plurality of textual phrasesaccording to the assigned scores (act 404). For example, ranking module306 can sort phrases 331, 332, 333, etc. according to assigned scores351, 352, 353, etc. In some embodiments, ranking module 306 sortsphrases based on assigned scores such that phrases with similarrelevancy to document 312 are grouped together.

Method 400 includes an act of selecting a subset of the plurality oftextual phrases from within the document based on rankings (act 405).For example, ranking module 306 can select phrases 332, 393, 394, etc.,from within document 312 based on rankings. Method 400 includes an actof populating a key phrase data structure the selected subset of theplurality of textual phrases (act 406). For example, ranking module 306can populate key phrase data structure 307 with phrases 332, 393, 394,etc.

When a plurality of documents are processed (e.g., documents 312, 322,etc.), a document ID (e.g., document ID 311, 321, etc.) can travel alongwith each phrase to indicate the document where each phrase originated.

The present invention extends to methods, systems, and computer programproducts for identifying key phrases within documents. Embodiments ofthe invention include using a tag index to determine what a documentprimarily relates to. For example, an integrated data flow andextract-transform-load pipeline, crawls, parses and word breaks largecorpuses of documents in database tables. Documents can be broken intotuples. The tuples can be sent to a heuristically based algorithm thatuses statistical language models and weight+cross-entropy thresholdfunctions to summarize the document into its “top N” most statisticallysignificant phrases. Accordingly, embodiments of the invention scaleefficiently (e.g., linearly) and (potentially large numbers of)documents can be characterized by salient and relevant key phrases(tags).

The present invention may be embodied in other specific forms withoutdeparting from its spirit or essential characteristics. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive. The scope of the invention is, therefore, indicatedby the appended claims rather than by the foregoing description. Allchanges which come within the meaning and range of equivalency of theclaims are to be embraced within their scope.

What is claimed is:
 1. At a computing system including one or moreprocessors and system memory, a method implemented by the computingsystem for identifying key phrases within a document, the methodcomprising: an act of accessing a document containing a plurality oftextual phrases; for each textual phrase in the plurality of textualphrases contained the document: an act of generating a location list forthe textual phrase, the location list indicating one or more locationsof the textual phrase within the document; an act of assigning a scoreto the textual phrase based on the contents of the location list for thetextual phrase relative to the occurrence of the textural phrase in atraining set of data; an act of ranking the plurality of textual phrasesaccording to the assigned scores; an act of selecting a subset of theplurality of textual phrases from within the document based on rankings;and an act of populating a key phrase data structure the selected subsetof the plurality of textual phrases.
 2. The method as recited in claim1, further comprising an act of generating the training set of datathrough a plurality of queries to a search engine.
 3. The method asrecited in claim 1, wherein the training set of data is a languagemodel.
 4. The method as recited in claim 1, wherein the ranking includessorting a plurality of textual phrases associated with the documentbased on assigned scores such that textual phrases that are determinedto have a similar relevancy to the document are grouped together.
 5. Themethod as recited in claim 1, wherein the method further includesappending the document with a document identifier.
 6. The method asrecited in claim 5, wherein the document identifier indicates where thetextual phrase occurs in the document.
 7. The method as recited in claim1, wherein the method includes identifying a set of one or more moststatistically significant textual phrases in the document.
 8. A computerprogram product for use at a computing system, the computer programproduct comprising one or more computer storage devices having storedthereon computer-executable instructions that, when executed at aprocessor, cause the computing system to perform a method foridentifying key phrases within a document, wherein the method includesthe computing system performing the following: an act of accessing adocument containing a plurality of textual phrases; for each textualphrase in the plurality of textual phrases contained the document: anact of generating a location list for the textual phrase, the locationlist indicating one or more locations of the textual phrase within thedocument; an act of assigning a score to the textual phrase based on thecontents of the location list for the textual phrase relative to theoccurrence of the textural phrase in a training set of data; an act ofranking the plurality of textual phrases according to the assignedscores; an act of selecting a subset of the plurality of textual phrasesfrom within the document based on rankings; and an act of populating akey phrase data structure the selected subset of the plurality oftextual phrases.
 9. The computer program product as recited in claim 8,further comprising an act of generating the training set of data througha plurality of queries to a search engine.
 10. The computer programproduct as recited in claim 8, wherein the training set of data is alanguage model.
 11. The computer program product as recited in claim 8,wherein the ranking includes sorting a plurality of textual phrasesassociated with the document based on assigned scores such that textualphrases that are determined to have a similar relevancy to the documentare grouped together.
 12. The computer program product as recited inclaim 8, wherein the method further includes appending the document witha document identifier and wherein the document identifier indicateswhere the textual phrase occurs in the document.
 13. The computerprogram product as recited in claim 8, wherein the method includesidentifying a set of one or more statistically significant textualphrases in the document and using a tag index to identify what thedocument primarily relates to based on the one or more moststatistically significant textual phrases in the document.
 14. Acomputing system comprising: at least one processor; and one or morecomputer-readable media having stored computer-executable instructionsthat, when executed by the at least one processor, cause the computingsystem to perform a method for identifying key phrases within adocument, wherein the method includes the computing system performingthe following: an act of accessing a document containing a plurality oftextual phrases; for each textual phrase in the plurality of textualphrases contained the document: an act of generating a location list forthe textual phrase, the location list indicating one or more locationsof the textual phrase within the document; an act of assigning a scoreto the textual phrase based on the contents of the location list for thetextual phrase relative to the occurrence of the textural phrase in atraining set of data; an act of ranking the plurality of textual phrasesaccording to the assigned scores; an act of selecting a subset of theplurality of textual phrases from within the document based on rankings;and an act of populating a key phrase data structure the selected subsetof the plurality of textual phrases.
 15. The computing system as recitedin claim 14, further comprising an act of generating the training set ofdata through a plurality of queries to a search engine.
 16. Thecomputing system as recited in claim 14, wherein the training set ofdata is a language model.
 17. The computing system as recited in claim14, wherein the ranking includes sorting a plurality of textual phrasesassociated with the document based on assigned scores such that textualphrases that are determined to have a similar relevancy to the documentare grouped together.
 18. The computing system as recited in claim 14,wherein the method further includes appending the document with adocument identifier and wherein the document identifier indicates wherethe textual phrase occurs in the document.
 19. The computing system asrecited in claim 14, wherein the method includes identifying a set ofone or more statistically significant textual phrases in the documentand using a tag index to identify what the document primarily relates tobased on the one or more most statistically significant textual phrasesin the document.
 20. The computing system as recited in claim 14,wherein the one or more computer-readable media comprises system memory.